Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: named axis for ak.Array #3238

Merged
merged 57 commits into from
Oct 11, 2024
Merged

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Sep 12, 2024

Proposal for named axis

This PR addresses #2596.

References for other named axis implementations:

Motivation

As argumented at PyHEP.dev 2023 and by the Harvard NLP group in their "Tensor Considered Harmful" write-up, named axis can be a powerful tool to make code more readable and less error-prone.

Design

ak.Array with named axis

Named axis are implemented through a mapping from named axis to positional axis.
named axis are hashables (currently restricted to strings), except for integers (and None) as they are reserved for positional axis.

import typing

AxisName: typing.Alias = typing.Hashable

By default an ak.Array uses positional axis, but named axis can be added to the array in the following ways:

import awkward as ak

# tuple:
#   positional axis: (0, 1)
#   named axis: {"events": 0, "jets": 1}
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# dict:
#   positional axis: (0, 1)
#   named axis:  {"events": 0, "jets": 1}
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"events": 0, "jets": 1})

# the dict interface allows to name single axis, also negative positional axis
array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"jets": -1})

# attach axis naming to an existing array
array = ak.Array([[1, 2], [3], [], [4, 5, 6]])
array = ak.with_named_axis(array, ("events", "jets"))
# or
array = ak.with_named_axis(array, {"events": 0, "jets": 1})

The named_axis argument of the constructor of an ak.Array is a either tuple of AxisName or a dict of AxisName to integers.
It is stored in the .attrs attribute of the array with a reserved key "__named_axis__" of type dict[AxisName, int].
The two types of axis can be accessed through the named_axis and positional_axis property (always represented as a tuple):

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))
array.named_axis
>>>  {"events": 0, "jets": 1}
array.positional_axis
>>> (0, 1)

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis={"jets": -1})
array.named_axis
>>> {"jets": -1}

Named axis in high-level functions

Named axis can be used by all high-level functions, e.g. ak.sum, ak.max, etc.:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# sum over the "jets" axis
sum_jets = ak.sum(array, axis="jets")
>>> ak.Array([3, 3, 0, 15])
sum_jets.named_axis
>>> {"events": 0}

# the `keepdims=True` argument keeps the named axis
sum_jets = ak.sum(array, axis="jets", keepdims=True)
>>> ak.Array([[3], [3], [], [15]])
sum_jets.named_axis
>>> {"events": 0, "jets": 1}

There are different scenarios how named axis are propagated to the resulting array:

  1. Named axis are unchanged e.g. ak.sum(array, axis="jets", keepdims=True) or array ** 2.
  2. Named axis are removed from the resulting array e.g. through reduction ak.sum(array, axis="jets").
  3. Named axis are unified e.g. in binary operations of two ak.Array or broadcasting:
import awkward as ak

array1 = ak.Array([[1, 2], [3, 4]], named_axis=("In", None))
array2 = ak.Array([[5, 6], [7, 8]], named_axis=(None, "Out"))

(array1 + array2).named_axis
>>> {"In": 0, "Out": 1}

Here, checks for matching named axis are performed, the rules are:

ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=("foo",))    # OK
ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=(None,))     # OK
ak.Array([1], named_axis=("foo",)) + ak.Array([1], named_axis=("bar",))    # raise Exception

Named axis in indexing

In addition, named axis can be used to select data:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# select the first event
array[{"events": 0, "jets": slice(None)}]
>>> <Array [1, 2] jets:0 type='2 * int64'>

# select the first jet of each event
array[{"events": slice(None), "jets": slice(0, 1)}]
>>> <Array [[1], [3], [], [4]] events:0,jets:1 type='4 * var * int64'>

# mixed positional & named indexing
array[{0: slice(None), "jets": slice(0, 1)}]
>>> <Array [[1], [3], [], [4]] events:0,jets:1 type='4 * var * int64'>

For synthatic sugar use np.s_:

import awkward as ak

array = ak.Array([[1, 2], [3], [], [4, 5, 6]], named_axis=("events", "jets"))

# select the first jet of each event
array[{"events": np.s_[...], "jets": np.s_[0:1]}]
>>> <Array [[1], [3], [], [4]] events:0,jets:1 type='4 * var * int64'>

# or mixed with positional axis
array[{0: np.s_[...], "jets": np.s_[0:1]}]
>>> <Array [[1], [3], [], [4]] events:0,jets:1 type='4 * var * int64'>

This PR has to touch a lot of code and needs to add custom named axis propagation to each high-level operation. Thus, this PR is currently in draft mode.

Looking forward to ideas, thoughts, feedback on this effort!

@pfackeldey pfackeldey changed the title Feat: named axis for ak.Array feat: named axis for ak.Array Sep 12, 2024
@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Sep 13, 2024

Progress

general

  • documentation for named axis

broadcasting

  • NumPy-style (right broadcasting)
  • Awkward-style (left broadcasting)

slicing

  • positional axis only (e.g. array[0])
  • named axis only (e.g. array[{"events": 0}])
  • mixed positional and named axis (e.g. array[{0: 0, "jets": 0}])

Unary and binary operations

  • unary operations (e.g. array ** 2)
  • binary operations (e.g. array1 + array2)

high-level functions

New:

  • ak.with_named_axis
  • ak.without_named_axis

Can be used with named axis:

  • ak.all
  • ak.almost_equal
  • ak.angle
  • ak.any
  • ak.argcartesian
  • ak.argcombinations
  • ak.argmax
  • ak.argmin
  • ak.argsort
  • ak.array_equal
  • ak.backend
  • ak.broadcast_arrays
  • ak.broadcast_fields
  • ak.cartesian
  • ak.categories
  • ak.combinations
  • ak.concatenate
  • ak.copy
  • ak.corr (not yet supported)
  • ak.count
  • ak.count_nonzero
  • ak.covar (not yet supported)
  • ak.drop_none
  • ak.enforce_type
  • ak.fill_none
  • ak.firsts
  • ak.flatten
  • ak.imag
  • ak.is_none
  • ak.local_index
  • ak.mask
  • ak.max
  • ak.mean
  • ak.min
  • ak.moment
  • ak.nan_to_none
  • ak.nan_to_num
  • ak.num
  • ak.ones_like
  • ak.pad_none
  • ak.prod
  • ak.ptp
  • ak.ravel
  • ak.real
  • ak.round
  • ak.run_lengths
  • ak.singletons
  • ak.softmax
  • ak.sort
  • ak.std
  • ak.strings_astype
  • ak.sum
  • ak.to_packed
  • ak.unflatten
  • ak.values_astype
  • ak.var
  • ak.where
  • ak.with_field
  • ak.with_name
  • ak.with_parameter
  • ak.without_parameters
  • ak.zeros_like
  • ak.zip

Independent of named axis: improvements / bugs found that are fixed by this PR aswell:

  • various typos in doc strings
  • Indexing could have multiple ... in certain cases, this is prohibited in NumPy (a705981)
  • keepdims argument in ak.corr and ak.covar was wrong for the mean calculation (00669a3)
  • avoid touching shape unnecessarily often when accessing .purelist_depth, .minmax_depth, and .branch_depth (typetracers) through self.inner_shape property of Numpy{Meta|Array} (1af4376)

@jpivarski
Copy link
Member

And all the data types that can be passed into square brackets with __getitem__.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 90.83527% with 79 lines in your changes missing coverage. Please review.

Project coverage is 82.27%. Comparing base (b749e49) to head (83c0aa6).
Report is 176 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/highlevel.py 45.91% 53 Missing ⚠️
src/awkward/_namedaxis.py 91.09% 17 Missing ⚠️
src/awkward/_broadcasting.py 92.85% 2 Missing ⚠️
src/awkward/operations/ak_without_named_axis.py 86.66% 2 Missing ⚠️
src/awkward/_regularize.py 92.30% 1 Missing ⚠️
src/awkward/_typing.py 50.00% 1 Missing ⚠️
src/awkward/contents/content.py 97.91% 1 Missing ⚠️
src/awkward/operations/ak_corr.py 87.50% 1 Missing ⚠️
src/awkward/operations/ak_covar.py 87.50% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/__init__.py 97.14% <100.00%> (+0.08%) ⬆️
src/awkward/_connect/numexpr.py 91.76% <100.00%> (+0.97%) ⬆️
src/awkward/_connect/numpy.py 92.06% <100.00%> (+0.04%) ⬆️
src/awkward/_layout.py 87.56% <100.00%> (+1.31%) ⬆️
src/awkward/_nplikes/array_like.py 97.14% <ø> (+27.75%) ⬆️
src/awkward/_nplikes/typetracer.py 75.05% <ø> (+0.19%) ⬆️
src/awkward/_operators.py 94.91% <ø> (ø)
src/awkward/contents/numpyarray.py 90.50% <100.00%> (-1.01%) ⬇️
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_all.py 96.66% <100.00%> (+1.01%) ⬆️
... and 68 more

... and 84 files with indirect coverage changes

pfackeldey and others added 18 commits September 13, 2024 16:47
…x named axis propagation in indexing for type tracers
…ak.mean; remove inplace addition of arrays from test
…tible with branched structures;fix regularize_axis in all highlevel ops
@pfackeldey
Copy link
Collaborator Author

Hi @agoose77,
yes, it is actually (almost) ready for review.
It just fails for win32 tests that I'm sure is unrelated to named axes: I added a test that unveiled a problem in broadcasting for win32 environments, and I'm very sure this is going wrong also for arrays without named axes. I marked this test with pytest.xfail but the test still doesn't pass? I don't understand why...
I'll merge main and satisfy the pylint test, then it should be ready for review 👍

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Oct 7, 2024

The PR is ready for review @agoose77 and @jpivarski. If you know why the windows tests are failing, please let me know if you have an idea... (I thought xfail would mark them as expected to fail and thus the test would pass?)

I'm sorry that the PR got so huge, but the largest part are tests related to named axes and translating named axes to positional axes in each high-level function.
I'd be especially happy for your review of named axes propagation in indexing and broadcasting operations as they are the most complicated ones.

@agoose77
Copy link
Collaborator

agoose77 commented Oct 7, 2024

@pfackeldey the xfail is failing because it's ... succeeding (xpass) 🤣

@jpivarski
Copy link
Member

jpivarski commented Oct 7, 2024

I'd like to see the rendered documentation, but the Deploy Branch Preview jobs are getting skipped. I don't know why: they were running PRs relatively recently.

Meanwhile, I'm still going over the rest of the PR.

Edit: It's because this is a fork, not a branch. That's okay; the markdown looks good; we'll see the rendered version after merging.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to review such a large PR, but we went over it at length during development. I think we iterated much more quickly because of that, and so I don't have much to say here, on the PR itself. The spot-check issues below are all minor. Still, we should check in again before the final merger, since everybody working on the codebase needs to be ready to incorporate the update when it comes.

src/awkward/_namedaxis.py Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_local_index.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_singletons.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_sort.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_std.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_sum.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_unflatten.py Outdated Show resolved Hide resolved
@jpivarski
Copy link
Member

Oh, and I should have mentioned that this is very high quality code (type annotations, docstrings, comments)! Thank you!

@pfackeldey
Copy link
Collaborator Author

Thank you very much for your review @jpivarski! I'll add your suggestions soon 👍

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small review -- I'll pop back tomorrow.

src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/_namedaxis.py Outdated Show resolved Hide resolved
src/awkward/_namedaxis.py Outdated Show resolved Hide resolved
src/awkward/_namedaxis.py Outdated Show resolved Hide resolved
src/awkward/_namedaxis.py Outdated Show resolved Hide resolved
src/awkward/_namedaxis.py Outdated Show resolved Hide resolved
@pfackeldey pfackeldey merged commit cd7d7f6 into scikit-hep:main Oct 11, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants